Optimized Relative Lempel-Ziv Compression of Genomes

نویسندگان

Shanika Kuruppu

Simon J. Puglisi

Justin Zobel

چکیده

High-throughput sequencing technologies make it possible to rapidly acquire large numbers of individual genomes, which, for a given organism, vary only slightly from one to another. Such repetitive and large sequence collections are a unique challange for compression. In previous work we described the RLZ algorithm, which greedily parses each genome into factors, represented as position and length pairs, which identify the corresponding material in a reference genome. RLZ provides effective compression in a single pass over the collection, and the final compressed representation allows rapid random access to arbitrary substrings. In this paper we explore several improvements to the RLZ algorithm. We find that simple non-greedy parsings can significantly improve compression performance and discover a strong correlation between the starting positions of long factors and their positions in the reference. This property is computationally inexpensive to detect and can be exploited to improve compression by nearly 50% compared to the original RLZ encoding, while simultaneously providing faster decompression.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.’s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, ...

متن کامل

A Compressed Self-Index for Genomic Databases

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a varian...

متن کامل

CHICO: A Compressed Hybrid Index for Repetitive Collections

Indexing text collections to support pattern matching queries is a fundamental problem in computer science. New challenges keep arising as databases grow, and for repetitive collections, compressed indexes become relevant. To successfully exploit the regularities of repetitive collections different approaches have been proposed. Some of these are Compressed Suffix Array, Lempel-Ziv, and Grammar...

متن کامل

Fast Relative Lempel-Ziv Self-index for Similar Sequences

Recent advances in biotechnology and web technology are generating huge collections of similar strings. People now face the problem of storing them compactly while supporting fast pattern searching. One compression scheme called relative Lempel-Ziv compression uses textual substitutions from a reference text as follows: Given a (large) set S of strings, represent each string in S as a concatena...

متن کامل

Lempel-Ziv Dimension for Lempel-Ziv Compression

This paper describes the Lempel-Ziv dimension (Hausdorff like dimension inspired in the LZ78 parsing), its fundamental properties and relation with Hausdorff dimension. It is shown that in the case of individual infinite sequences, the Lempel-Ziv dimension matches with the asymptotical Lempel-Ziv compression ratio. This fact is used to describe results on Lempel-Ziv compression in terms of dime...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Optimized Relative Lempel-Ziv Compression of Genomes

نویسندگان

چکیده

منابع مشابه

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

A Compressed Self-Index for Genomic Databases

CHICO: A Compressed Hybrid Index for Repetitive Collections

Fast Relative Lempel-Ziv Self-index for Similar Sequences

Lempel-Ziv Dimension for Lempel-Ziv Compression

عنوان ژورنال:

اشتراک گذاری